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Abstract 

The goal of temporal alignment is to establish time correspondence between two sequences, which has many 
applications in a variety of areas such as speech processing, bioinformatics, computer vision, and computer 
graphics. In this paper, we propose a novel temporal alignment method called least-squares dynamic time 
■warping (LSDTW). LSDTW finds an alignment that maximizes statistical dependency between sequences, 
measured by a squared-loss variant of mutual information. The benefit of this novel information-theoretic 
formulation is that LSDTW can align sequences with different lengths, different dimensionality, high non- 
linearity, and non-Gaussianity in a computationally efficient manner. In addition, model parameters such 
as an initial alignment matrix can be systematically optimized by cross-validation. We demonstrate the 
usefulness of LSDTW through experiments on synthetic and real-world Kinect action recognition datasets. 

1 Introduction 

Temporal alignment of sequences is an important problem with many practical applications such as speech 
recognition [TJ [5] , activity recognition [3J 0] , temporal segmentation [5] , curve matching [S] , chromatographic and 
micro-array data analysis [7] , synthesis of human motion [5] , and temporal alignment of human motion \Q 110) . 

Dynamic time warping (DTW) is a classical temporal alignment method that aligns two sequences by min- 
imizing the pairwise squared Euclidean distance p] [2] • An advantage of DTW is that the minimization can be 
efficiently carried out by dynamic programming (DP) [llj . However, due to the Euclidean formulation, DTW 
may not be able to find a good alignment when the characteristics of the two sequences are substantially different 
(e.g., sequences have different amplitudes). Moreover, DTW cannot handle sequences with different dimensions 
(e.g., image to audio alignment), which limits the range of applications significantly. 

To overcome the weaknesses of DTW, canonical time warping (CTW) was introduced [9]. CTW performs 
sequence alignment in a common latent space found by canonical correlation analysis (CCA) [T2] . Thus, CTW 
can naturally handle sequences with different dimensions. However, CTW can only deal with linear projections, 
and it is difficult to optimize model parameters such as the initial alignment matrix, the regularization parameter 
used in CCA, and the dimensionality of the common latent space. 

To handle non-linearity, dynamic manifold temporal warping (DMTW) was recently proposed in [4] . DMTW 
first transforms original data onto a one-dimensional non-linear manifold and then finds an alignment on this 
manifold using DTW. Although DMTW is highly flexible by construction, its performance depends heavily on 
the choice of the non-linear transformation and, moreover, it implicitly assumes the smoothness of sequences. 
For this reason, DMTW has limited applicability. 

In this paper, we propose a novel information-theoretic temporal alignment method based on statistical de- 
pendence maximization. Our method, which we call least-squares dynamic time warping (LSDTW), employs 
a squared-loss variant of mutual information called squared-loss mutual information (SMI) as a dependency 
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measure. SMI is estimated by the method of least-squares mutual information (LSMI) [13] . which is consistent 
estimator achieving the optimal non-parametric convergence rate to the true SMI. An advantage of the proposed 
LSDTW over existing methods is that it can naturally deal with non-linearity and non-Gaussianity in data 
through SMI. Moreover, cross-validation (CV) with respect to the LSMI criterion is possible, which allows selec- 
tion of model parameters such as the initial alignment matrix, the Gaussian kernel width, and the regularization 
parameter. Furthermore, the formulation of LSDTW is quite general and does not require strong assumptions on 
the topology of the latent manifold (e.g., smoothness). Thus, LSDTW is expected to perform well in a broader 
range of applications. Indeed, through experiments on synthetic and real-world Kinect action recognition tasks, 
LSDTW is shown to be a promising alternative to existing temporal alignment methods. 

2 Dependence Maximizing Temporal Alignment via SMI 

In this section, we first formulate the problem of dependence maximizing temporal alignment (DMTA) and then 
develop a DMTA method based on squared-loss mutual information (SMI) |13j . 

2.1 Formulation of Dependence Maximizing Temporal Alignment (DMTA) 

Given two sequences represented by a set of samples (ordered in time), 

{ Xl | Xi G R d -}Zi and {y 3 \ Vj 6 M dy }™= 1 , 

the goal of DMTA is to find a temporal alignment such that the statistical dependency between two sets of 
samples is maximized. Note that n x and d x can, in general, be different from n y and d x . 

Let 7r x and 7r y be alignment functions over {1, . . . ,n x } and {1, ...,%}, and let II be the corresponding 
alignment matrix: 

n := [tt x 7r y ] T e M 2xm , 
tt* := [<...,<J T e{l, 
7r*:= K,...,<] T e{i, 

where m is the number of indexes needed to align the sequences and 1 denotes the transpose. II needs to satisfy 
the following three additional constraints: 

• Boundary condition: [7r x 7r y ] T = [1 1] T and [ir^ 7r y J T = [n x %] T . 

• Continuity condition: < irf — irf_ 1 < 1 and < 7r y — tt^_ 1 < 1. 

• Monotonicity condition: t\ > t2 —> 7r£ > 7rf 2 , tt^ > 7r y 2 . 

Let us denote the paired samples aligned by 7r x and 7r y as 

Z(U) :={{x^, y< )}? =1 . 

Then, the optimal alignment, denoted by II*, is defined as the maximum of a certain statistical dependence 
measure D between the two sets {a^}™^ and {yj}^^: 

IT := argmax D(Z(U)). 
n 

2.2 Least-Squares Dynamic Time Warping (LSDTW) 

A popular measure of statistical dependence is mutual information (MI) |14) , and its estimation has been studied 
thoroughly |151 1161 [T71 1181 119) . However, these MI approximations are computationally expensive, mainly due 
to the non-linearity introduced by the "log" function. In this paper, we propose to use a squared-loss variant of 
MI called squared-loss MI (SMI), which results in a simple and computationally efficient estimation algorithm 
called least-squares dynamic time warping (LSDTW). 



■ • ■ ,n x } r 
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2.2.1 Overview 

The optimization problem of LSDTW is denned as 



n* := argmaxSMI(Z(n)), (1) 
n 



where SMI is defined and expressed as 



= \(((^^)p^,y)^y-\, (2) 

2 J J \px{x)p y (y)J 2 

where p xy (x,y) is the joint density of x and y, and p x (x) and p y (y) are the marginal densities of x and y, 
respectively. SMI is the Pearson divergence |20| from p xy (x,y) to p x (x)p y (y), while the ordinary MI is the 
Kullback-Leibler divergence |21j from p xy (x,y) to p x (x)p y (y). SMI is non-negative and is zero if and only if x 
and y are statistically independent, as the ordinary MI. 

Based on Eq.|l]), we develop the following iterative algorithm for estimating II: 

(i) Initialization: Initialize the alignment matrix II. 

(ii) Dependence estimation: For the current II, obtain an SMI estimator SMI(Z(II)). 

(iii) Dependence maximization: Given an SMI estimator SMI(Z(n)), obtain the maximum alignment n. 

(iv) Convergence check: The above (ii) and (iii) are repeated until II fulfills a convergence criterion. 

2.2.2 Dependence Estimation 

In the dependence estimation step, we utilize a non-parametric SMI estimator called least-squares mutual infor- 
mation (LSMI) |13j . which was shown to possess a superior convergence property [22] . Here, we briefly review 
LSMI. 

Basic Idea: The key idea of LSMI is to directly estimate the density ratio in Eq. ^ [53] , 

/ \ P* y {x,y) 
r(x,y) ■- 



Pyi {x)p y {yy 



from paired samples Z{Ti) = {{x^,y^y)} r lL 1 without going through density estimation of p xy (x,y), p K (x), and 
p y (y)- Here, the density-ratio function r(x,y) is directly modeled as 

m 

r a (x,y) = ^2a e K(x,x 7Tf )L(y,y 7r y), (3) 
i=\ 

where K(x,x') and L(y,y') are kernel functions (e.g., Gaussian kernels) for x and y, respectively. 

Then, the parameter a = (a%, . . . ,a m ) T is learned so that the squared error to the true density ratio is 
minimized: 

J (a) := - JJ(r a (x,y) - r{x, y)) 2 p^(x)p y (y)dxdy. 
After a few lines of calculation, J can be expressed as 

Ma) = J(a) + SMI(Z(n)) + 1 
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where 

J(a) := -a T Hjja — h]ja., 

H n ,W '■= JJ K { x , x ^) L (y,y7ry)K(x,x 7T ^)L(y,y 7r y i )p x (x)p y (y)dxdy, 

hn,e ■= J J K(x,x 7T j)L(y,y n y)p X y(x,y)dxdy. 

Since SMI(Z(n)) is constant with respect to a, minimizing J is equivalent to minimizing J. 

Computing the Solution: Approximating the expectations in Hjj and hu included in J by empirical 
averages, we have the following optimization problem: 

min -a T Hjjct — h]ja + —a T a , (4) 
a [2 2 

where \a. T a./2 is the regularization term to avoid overfitting, A (> 0) is the regularization parameter, and 
Hn,e,e> ■= X K ( x ^f> x ^) L (Vnpy^)K(x^,x^ i )L(y n r,y^ y ^)), 

rn 

h n,£ ■= — 'Y]K(x^,x v f)L(y v y > y^). 



Differentiating Eq.Q with respect to a. and equating it to zero, we can obtain the optimal solution Sn analyt- 
ically as 

Sn = (H n + A/) _1 h n , (5) 

where I is the m x m identity matrix. Note that, LSMI has time complexity 0(m 3 ) due to the matrix inversion. 
However, when the number of training data is large, we can reduce the number of kernels in Eq.|3]) to l(< m) 
by sub-sampling. With this approximation, the inverse matrix in Eq.([5]) can be computed with time complexity 

o(i 3 ). 

Finally, the following SMI estimator can be obtained by taking the empirical average of Eq.(§ as 

i m i 

SMI(Z(H)) = g-XVan^.v,*) - 5- ( 6 ) 

i=l 

Model Selection: Hyper-parameters included in the kernel functions and the regularization parameter can 
be optimized by cross-validation with respect to J (13) . which is described below. 

First, samples Z = {(xi, yi)}"—i are divided into K disjoint subsets {Zk}£ =1 of (approximately) the same 
size. Then, an estimator ot-z k is obtained using Z\Z^ (i.e., all samples without Z^), and the approximation error 
for hold-out samples Z^ is computed as 



J z k ■= o a z k H z k a z k - h Zk a Zk , 



where, for \Z}-\ being the number of samples in subset Z^, 



[Hz k )w ■= T^-|2 X X K(x,xi)L(y,y e )K(x,x e ,)L(y, 
$z h ]f-=T~-i X K{x,x e )L(y,y e ). 



\Z k \ 

1 K| (x,y)£Z k 
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This procedure is repeated for k = 1, . . . , K, and its average J^" cv ) is taken as 

1 K 

j(K-CV) . = J_ V j( K " CV ) 

l{ Z. ' Z k 

fe=l 

Finally, we compute j( K ~ cv ) for all model candidates, and choose the one with minimum j( K ~ cw ) . 



2.3 Dependence Maximization 

Based on the empirical estimate of SMI, the dependence maximization problem is given as 

max SMI(Z(n)). 



n 



We here provide a computationally efficient approximation algorithm based on dynamic programming (DP) 
Let us rewrite the empirical SMI, Eq.(|6]), as 

SMI(Z(n)) = -L^^5(7r?,7rJ)r an (^, % ) - \, 

where 



2m 

i=l 3 = i 



I if cc^x and y^y are paired, 

1 3 

otherwise. 

Then, the solution is updated with the current II old as 



°K*t> w 3) o otherwise. 



IF™ = argmax £ £ <5«, *>a nold (x„ Vi ). (7) 
i=i j=i 

This problem can be efficiently solved by DP with time complexity 0(n x n y ) (see Appendix). Note, however, 
that the solution to Eq.([7]) does not always increase the empirical SMI, Eq.Q; we update the alignment matrix 
II only if the SMI score increases after the update. 

3 Related Methods 

In this section, we review existing temporal alignment methods which are based on pairwise distance minimization 
(not dependence maximization) and point out their potential weaknesses. 

3.1 Dynamic Time Warping (DTW) 

The goal of dynamic time warping (DTW) is, given two sequences of the same dimensionality and the different 
number of samples, {cc,; | Xi £ K^}"^! and {yj | yj £ K. }j=d to find an alignment such that the sum of pairwise 
distances between two sets is minimized [TJ [5] : 

m 

t=l 

where m is the number of indices needed to align the sequences. 7r x and 7r y need to satisfy the boundary, con- 



tinuity, and monotonicity conditions (see Section 2.1). The above DTW optimization problem can be efficiently 
solved by DP with time complexity 0(n x n y ). 

A potential weakness of DTW is that it cannot handle sequences with different dimensions such as image 
to audio alignment. Moreover, even when the dimensionality of sequences is the same, DTW may not be able 
to find a good alignment of sequences with different characteristics such as sequences with different amplitudes. 
These drawbacks highly limit the applicability of DTW. 
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3.2 Canonical Time Warping (CTW) 



Canonical time warping (CTW) can align sequences with different dimensions in a common latent space [9l [10] . 
The CTW optimization problem is given as 

min \\V x T XWj -V y T YWJ\\ 2 Fmh , (8) 

where ||-|kob is the Frobenius norm, X = [x l7 ...,x n J e R ix "», F = [y u . . . , y Hy ] e K d y x "y, W x £ {0,l} mx " x 
and W y G {0, l} mxi v are binary selection matrices that need to be estimated to align X and Y, and V x € R d x xfc 
and Vy £ K d >- Xb (6 < min(d x , d y )) are linear projection matrices of x and y onto a common latent space, 
respectively. The above optimization problem can be efficiently solved by alternately solving CCA and DTW, 
where the alignment matrix obtained using DTW is usually used as initialization (initial alignment matrix). 

However, since CTW finds a common latent space using CCA, it can only deal with linear and Gaussian 
temporal alignment problems. Thus, CTW cannot properly deal with multi-modal and non-Gaussian data. 
Another limitation of CTW is that comparing the alignment quality over different model parameters is not 
straightforward. This is because, for different model parameters, a common latent space found by CCA is 
generally different and thus the metric of the pairwise distance Eq.([8| is also different. For this reason, a 
systematic model selection method for the regularization parameter, dimensionality of the common latent space, 
and the initial alignment matrix has not been developed so far, to the best of our knowledge. 

3.3 Dynamic Manifold Temporal Warping (DMTW) 

Dynamic manifold temporal warping (DMTW) is a non-linear extension of CTW [4]. 
The DMTW optimization problem is defined as 

min || T x {X)Wj - J r y{Y)W y T \\ FToh , 

where J" x : M^"* — > jj bx ™x anc j jr y . jjd y xn y _^ jjfcx« y are non _ij near mapping functions that map x and y to 
a common latent subspace. DMTW first maps X and Y to a one-dimensional smooth manifold (i.e., b = 1) by 
the tensor voting method |24j and then align sequences on the manifold. 

DMTW highly depends on a specific non-linear transformations and requires the smooth manifold assumption. 
Thus, the usage of DMTW is limited to specific applications. On the other hand, CTW and LSDTW do not 
require the latter strong assumption and thus can be useful for a broader range of applications. This assertion 
will be experimentally validated in the next section. 

4 Experiments 

In this section, we experimentally evaluate our proposed LSDTW method on synthetic and real-world Kinect 
action recognition tasks. 

4.1 Setup 

In LSDTW, we use the Gaussian kernels: 

*«••*)--» ( Jh ^f) • (- 1 ^f I ) • 

where tr x , ct v , and A are chosen by 3-fold CV from 

( CTx , <j y ) = c x (m x , my), c = 2~ 1/2 , 1.8~ 1/2 , . . . , 0.2- 1/2 , A = 1CT 1 , 1(T 2 , 
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Figure 1: Results of synthetic experiments, (a-1) Multi-modal data, (a-2) Alignment paths, (a-3) SMI score as 
a function of the number of iterations, (b-1) A synthetic dataset with an additive exponential noise (77 = 2.4). 
(b-2) Alignment paths. Here, the alignment error of LSDTW, CTW, and DTW are 5.93, 26.28, and 87.69, 
respectively, (b-3) The mean alignment error over 100 runs as functions of noise level rj. 



and 

m x = 2- 1 / 2 median({||a;. i - x,\\}^ =1 ), m y = 2- 1 / 2 median({||y. 1 - yj\\}^ j=1 ). 

Due to non-convex nature of the objective, setting a good initial alignment is an important issue for LSDTW. 
Here, from the alignment obtained using CTW and the simple uniform initialization, 

7r x = [1, Ll + n x /mJ,U + 2n x /mJ,...,n x ] T G M mxl , 
7r y = [1, [1 + %/mJ , [1 + 2 % /mJ , . . . , % ] T 6 M mxl , 

where m = min(n x , n y ), we choose the one with the largest SMI score as the initial alignment for LSDTW. 

We compare the performance of LSDTW with DTW and CTW. For CTW, we choose the dimensionality 
of CCA to preserve 90% of the total correlation, and we fix the regularization parameter at 0.01. We use the 
alignment given by DTW as the initial alignment of CTW. 

To evaluate the alignment results, we use the following standard alignment error |10| : 

Error = dist(nMi) + dist(ri,n-) ; ^ } = g min({ ||^) _ 

where II* and II are true and estimated alignment matrices and tv[ 1 \ tv^ € M 2x1 are the i-th and j-th row of 
II! and n 2 , respectively. 
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4.2 Synthetic Dataset 



First, we illustrate the behavior of the proposed LSDTW method for non-linear and non-Gaussian data using 
synthetic datasets. 

Non-linear (multi-modal) data^J 

Xi = Zi + 0.4 sin(27TZi), i = 1, . . . , 1000, 
Vj = z 0-i)x2+i, j = 1, • • ■ ,500, 

where Zj = i/1000 (see Figure [I]- (a-1)). 

Figure [T]-(a-2) shows the alignment path obtained by LSDTW, CTW, and DTW, respectively. In this experi- 
ment, we initialize LSDTW and CTW with the true alignment matrix and check whether those methods perform 
well. As can be observed, LSDTW can find a better alignment in the middle region (i.e., a multi-modal region) 
than DTW and CTW. This shows that LSDTW objective is much better than alternatives when it comes to 
multi-modal data. Figure [l]-(a-3) depicts the SMI score with respect to the number of iterations in LSDTW, 
showing that LSDTW converges in 10 iterations. 

Non- Gaussian data: 

X = VlZ Ml + V E X , Y = UjZMj + 7}E y , 

where U x ,U y G K 2x2 are randomly generated affine transformation matrices, Z E R 2xm is a trajectory in two 
dimensions, M x g j|»xX?n an( j ^ g jgm x xm are ran domly generated matrices for time warping, E x £ R 2 *™* 
and E y £ R 2x "y are randomly generated additive exponential noise with rate parameter 1 (and its mean is 
adjusted to be zero), and 77 = {0, 0.3, 0.6, . . . , 3.0} is the noise level. Note that larger noise level 77 means stronger 
non-Gaussianity in the data. 

Figures [lj-(b-l) and (b-2) show an example of synthetic data with additive exponential noise (rj = 2.4) and 
corresponding alignment paths obtained by LSDTW, CTW, and DTW. As can be seen, only the proposed 
method can find a good alignment. Figure [T]-(b-3) shows the mean alignment error over 100 runs, from which 
we can confirm that the proposed method tends to outperform the existing methods for larger noise levels. 



4.3 Real-world Kinect Action Recognition Data 

Next, we evaluate the proposed LSDTW method on the publicly available Kinect action recognition dataselQ 
|25j . This dataset consists of the human skeleton data (15 joints) obtained using a Kinect sensor, and there are 
16 subjects and 16 actions with 5 runs. Instead of using the raw skeleton data, we here use the 105-dimensional 
feature vector, where each element of the feature vector is the Euclidean distance between joint pairs. 

In evaluation, we carry out unsupervised action recognition experiments and evaluate the performance of 
alignment methods by classification accuracy. More specifically, we first divide the action recognition dataset 
into two disjoint subsets: 8 subjects (#T#8) with all actions for testing (in total 640 sequences) and the 
remaining subjects (#9-#16) with all actions for "training" database (in total 640 sequences). Then, we retrieve 
N = 10 similar actions for each test action from the database by DTW, CTW, and LSDTW. Here, we use 
the pairwise Euclidean distance based on an estimated alignment to measure the similarity between sequences. 
Finally, if there is at least one correct action in the retrieved sequences, we regard the action to be correctly 
retrieved. 

Figure [2] shows the mean classification accuracy as functions of the number of retrieved actions, N, where 
three different database sizes are tested. The graphs clearly show that LSDTW compares favorably with existing 
methods in terms of classification accuracy. 

1 A distribution of a multi-modal data has two or more modes. 
2 ww.cs . ucf . edu/-smasood/datasets/UCFKinect . zip 
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Figure 2: Mean classification accuracy with respect to the number of retrieved actions, (a) Only sequences of 
subject#9 are used as database, (b) Sequences of Subjects#9 — 12 are used as database, (c) Sequences of all 
subjects (#9 — 16) are used as database. 

5 Conclusions 

In this paper, we proposed a novel temporal alignment framework called the dependence maximization temporal 
alignment (DMTA) and developed a DMTA method called the least-squares dynamic time warping (LSDTW). 
LSDTW adopts squared-loss mutual information as a dependence measure, which is efficiently estimated by the 
method of least-squares mutual information. Notable advantages of LSDTW are that it can naturally deal with 
non-linear and non-Gaussian sequences and it can optimize model parameters such as the Gaussian kernel width 
and the regularization parameter by cross-validation. We applied the proposed method on the Kinect action 
recognition task, and experimentally showed that LSDTW is the promising alternative to the compared methods. 
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Appendix 

Given the empirical estimate of SMI computed at the dependence estimation step (Sect. 2.2.2), the depence 
maximization problem is given as 

. , i m i 

max SMI(Z(II)) = max — V r a „ (a;^ , y n r ) - - 
n n 2m ' — ' ™ * • 2 

i=l 

m m old 

Eq. (3) 1 \ -» ^-^ _ . . .1 

i=l £=1 
^ m m old 

~ r max > > atKix^.x^ )L(y^t,y„y ) — — . 

v x y; »=1 *=i 

Based on the constraints on the alignment functions II described in Sect. 2.1, this optimal alignement can be 
computed by dynamic programming (DP) |llj . In order to verify this, we define the prefix sequences X n := 
{x % | x, G M rfx }" = i and F n / := {yj \ yj G M dy }"l l5 with n < n x and ri < n y , and set A(n,n') := $MI(X n ,Y n ,) 
denoting the optimal SMI for the aligned prefix sequences X n and Y n > . 
Following the boundary conditions of the alignement functions, we have: 

A(l,i) = r anold (*i,yi). (10) 

Based on the continuity and monotonicity conditions, the DP-equation is given as 

A(n,ri) = m&x{A(n - l,n' - i),A(n - l,n'),A(n,n' - 1)} + ?"an oM ( x n , Vn> ) , (H) 

for 1 < n < n K and 1 < n! < n y . Therefore, the optimal SMI(Z(II)) = 2 ( K +n ) ^( n x> n y) — \ can be computed in 
0(n x n y ). Given the accumulated cost matrix A, we can compute the optimal alignment II using backtracking. 
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